Aller au contenu principal

6 articles tagués avec « devops »

Voir tous les tags

· 4 minutes de lecture
Idriss Neumann

In a previous blogpost we explained how we reduced our observability bill using Quickwit thanks to its ability to store the logs and traces using object storage:

quickwit-architecture

We also said that we were using VictoriaMetrics in order to store our metrics but weren't satisfied by it lacks of object storage support.

We always wanted to store all our telemetry, including the metrics, on object storage but weren't convinced by Thanos or Mimir which still rely on Prometheus to work making them very slow.

The thing is for all of cwcloud's metrics, we're using the OpenMetrics format with a /v1/metrics endpoint like most of the modern observable applications following the state of art of observability.

Moreover, all of our relevant metrics are gauges and counter and our need is to set Grafana dashboards and alerts which looks like this:

grafana-trafic-light-dashboard

In fact, we discovered that it's perfectly perfectly feasible to setup the different threshold and do some Grafana visualizations based on simple aggregations (average, sum, min/max, percentiles) using the Quickwit's datasource:

grafana-trafic-light-visualization

However, if you're used to also search and filter metrics using PromQL in the metrics explorer, you'll have to adapt your habits to use lucene query instead:

grafana-quickwit-metrics-explorer

As you can see, it's not a big deal ;-p

That been said, in order to scrap and ingest the prometheus/openmetrics http endpoints, we choosed to use vector1 with this configuration:

sources:
prom_app_1:
type: "prometheus_scrape"
endpoints:
- "https://cloud-api.comwork.io/v1/metrics"

transforms:
remap_prom_app_1:
inputs: ["prom_app_1"]
type: "remap"
source: |
if is_null(.tags) {
.tags = {}
}

.tags.source = "prom_app_1"

sinks:
quickwit_app_1:
type: "http"
method: "post"
inputs: ["remap_prom_app_1"]
encoding:
codec: "json"
framing:
method: "newline_delimited"
uri: "http://quickwit-searcher.your_ns.svc.cluster.local:7280/api/v1/prom-metrics-v0.1/ingest"

Note: you cannot transform the payload structure the way you want unlike other sources like kubernetes-logs or docker_logs sources but you can add some tags to add a bit of context. That's what we did in this example adding a source field inside the tags object.

And this is the JSON mapping to be able to match with the vector output sent to the sinks and that will make you able to make aggregations on the numeric values:

{
"doc_mapping": {
"mode": "dynamic",
"field_mappings": [
{
"name": "timestamp",
"type": "datetime",
"fast": true,
"fast_precision": "seconds",
"indexed": true,
"input_formats": [
"rfc3339",
"unix_timestamp"
],
"output_format": "unix_timestamp_nanos",
"stored": true
},
{
"indexed": true,
"fast": true,
"name": "name",
"type": "text",
"tokenizer": "raw"
},
{
"indexed": true,
"fast": true,
"name": "kind",
"type": "text",
"tokenizer": "raw"
},
{
"name": "tags",
"type": "json",
"fast": true,
"indexed": true,
"record": "basic",
"stored": true,
"tokenizer": "default"
},
{
"name": "gauge",
"type": "object",
"field_mappings": [
{
"name": "value",
"fast": true,
"indexed": true,
"type": "f64"
}
]
},
{
"name": "counter",
"type": "object",
"field_mappings": [
{
"name": "value",
"fast": true,
"indexed": true,
"type": "f64"
}
]
},
{
"name": "aggregated_summary",
"type": "object",
"field_mappings": [
{
"name": "sum",
"fast": true,
"indexed": true,
"type": "f64"
},
{
"name": "count",
"fast": true,
"indexed": true,
"type": "u64"
}
]
},
{
"name": "aggregated_histogram",
"type": "object",
"field_mappings": [
{
"name": "sum",
"fast": true,
"indexed": true,
"type": "f64"
},
{
"name": "count",
"fast": true,
"indexed": true,
"type": "u64"
}
]
}
],
"timestamp_field": "timestamp",
"max_num_partitions": 200,
"index_field_presence": true,
"store_source": false,
"tokenizers": []
},
"index_id": "prom-metrics-v0.1",
"search_settings": {
"default_search_fields": [
"name",
"kind"
]
},
"version": "0.8"
}

To conclude, despite the fact that Quickwit isn't a real TSDB2 (time-series database), we found it pretty easy with vector to still use it as a metrics backend with vector. And this way we still can say to our developer to rely on the OpenMetrics/Prometheus SDK to expose their metrics routes to scrap. However we're still encouraging some of our customer to use VictoriaMetrics because it's still experimental and some of them need more sophisticated computation capabilities3.

One of the improvements that we immediatly think about, would be to also implement the OpenTelemetry compatibility in order to be able to push metrics through OTLP/grpc protocol. We opened an issue to the quickwit's team to submit this idea but we think that it can be also done using vector as well.


  1. to get more details on the prometheus_scrape input, you can rely on this documentation
  2. at the time of writing, because we know that Quickwit's team plan to provide a real TSDB engine at some point
  3. for example, using multiple metrics in one PromQL query, using the range functions such as rate or irate...

· 8 minutes de lecture
Idriss Neumann

During the last decade, you should have heard about serverless architecture or Function as a Service (or FaaS) many times. But sometimes you might have heard the word "serverless" also for other cloud services such as Database as a Service (or DBaaS) or Container as a Service (or CaaS).

What does those things have in common to get called "serverless"? At the beginning this word implied two conditions that I'll remind in this blogpost to start. Then I'll focus on the FaaS and explain my mind on why I think it has evolved last couple of years.

The first condition is you ain't supposed to know about the infrastructure that hosts the service you're using.

  • For a DBaaS, you just get an endpoint to connect your apps with and don't have to worry about the cluster sizing, scaling, hardware capabilities...
  • For a CaaS, you just have to tell to a simple API which container image and tag to deploy and don't have to worry about the clustering of your containers orchestrators. The CaaS might be built on top of Kubernetes (or K8S) with knative and the K8S API with the knative's CRD (Custom Resource Definition) can be considered as some sort of serverless API if you don't have to worry about the K8S cluster running behind
  • For a FaaS, you just have to implement a function in a supported programing language and don't have to worry about how this function will be built as a microservice1, exposed as a webservice and trigger with multiple events2

The second condition is the "pay as you go" kind of billing on public cloud: you ain't supposed to pay for dedicated clusters but only for the network, compute3 and storage used during the runtime of your code or transactions.

For example with a serverless database, you should get billed only for the data you'll ingest or fetch and the queries you'll run and not for an entire running cluster. Same with a CaaS or FaaS you should only get billed for the runtime of your containers or the necessary compute and network used during a function's call.

We can give more well known example of serverless offers you might have heard about on big cloud players:

  • AWS Lambda the very well known FaaS engine of amazon that has kind of set the developer experience of the FaaS in my opinion
  • GCP Cloudrun which is a CaaS built on top of K8S and knative
  • GCP Cloud functions the FaaS engine of GCP built on top of Cloudrun4
  • Azure function the FaaS engine of Microsoft Azure

Moreover, the GCP approach of building everything on top of K8S with knative leads the way for other cloud providers to provide similar experiences. It's the case for Scaleway which is also providing a CaaS and a FaaS built on top of knative.

That been said, I think the key feature of serverless and especially the Function as a Service isn't the "pay as you go" but it's more about adding an abstraction layer with the infrastructure allowing the developers to ship their code more quickly and get focus only on the business logic. That's why their's also FaaS engine you can install on premises such as OpenFaaS or our own cwcloud FaaS engine.

That's also something the industry is looking for decades with tons of tools you might have encounter:

  • BPM (Business Process Management)
  • ETL (Extract Transform Load)
  • CI/CD (Continuous Integration / Continuous Deployment) pipelines orchestrators
  • Workflow engine such as Airflow, Temporal, Cadence, Apache Nifi...
  • API backend frameworks: Spring, Laravel, FastAPI... to lower the complexity of exposing your code as an API or microservices
  • Nocode / Low code
  • etc

Those tools are different, meets different needs for different populations of IT workers, for example:

  • developers who want to focus only on the business logic and not how to expose this business logic as a service
  • data scientists who needs ETL or data pipelines
  • electronics engineers and IoT makers who needs to push notifications from their sensor and trigger some treatments on their devices and enjoy to do it with a lowcode editor5
  • product owners technical enough to use BPM, nocode or lowcode to translate their needs
  • system administrators who needs to collect and transform some logs for observability purposes or schedule some tasks
  • SRE (System Reliability Engineers) who needs to setup CI/CD pipelines

However they do have something in common: all those tools will generate functions (which are sometimes called "workflow" or "job" or "pipeline" or whatever) that will require some compute capabilities and an orchestrator to trigger and launch it. Moreover, those tools are designed to get rid of the maximum of technical aspect and make the IT workers focus only on the business aspects. Sounds like the promise of the serverless, doesn't it?

Because nowadays most of those tools are still bringing their own compute orchestrator, it might be very expensive for the maintainance. Lots of companies which are recruiting multiple kind of IT workers for their different needs find themselve installing all those solutions in their infrastuctures which requires dozens of SRE to handle this heavy maintainance. I used to work with scale-up asking to install all the tools I mentioned in this blogpost in K8S. It means installing dozens of jobs orchestrator on a job orchestrator (because K8S is also a job and pipeline orchestrator). This is ironic, isn't it?

ironic-meme

There's modern tools, mainly in the CI/CD area, which are designed to work on top of K8S in a gitops and serverless way. By that I mean re-using the K8S capabilities to orchestrate ephemeral tasks or even applications. It's the case of knative of course but also Tekton or ArgoWorkflow which are pretty similar tools allowing us to define serverless pipelines or workflows without having to install runners or particular runtime unlike most of the other CI/CD tools.

However, most of the other kind of tools I mentioned earlier will require to install their own orchestrator engine and reserve lot of resources in advance in order to be able to trigger their tasks, and that ain't serverless friendly. It's the case for Talend, Airflow, Cadence, gitlab or github runners, etc... We still have to work with those tools because they've not been completely replaced by FaaS engine even if we can notice that some cloud provider are trying to provide multiple services built on top of it6.

That's why, we decided with CWCloud to implement a single FaaS engine which aims to bring several "dev XP (developer experiences) for those different populations of IT workers and which is agnostic from the infrastructure running it7.

It's only the beginning but we already provide:

  • A code editor supporting the following programing languages: Python, Go, Javascript and even Bash
  • A lowcode editor supporting Blockly which is suitable for IoT makers, lowcode developers and product owners

faas-lowcode-editor

  • An API and CLI to be able to templatize the function's creation

faas-cli

Therefore, the created functions can be exposed as:

  • HTTPs endpoints like a RESTful API
  • Async workers which can be triggered with different kind of event: scheduler, cron expressions, etc...

Finally, you can choose to invoke the function and wait for the result in the http response in a blocking way (we discouraged it but sometimes you ain't got no choice), or set async callbacks. We're supporting the following callbacks:

  • HTTP webhook
  • MQTT or WSS (websockets) queues which are very suitable for IoT makers as well

This video tutorial might give you an ideo on the current dev XP:

faas-tutorial-player

To conclude, I believe that all those tools are the very definition of the "framework" concept for all these IT worker populations, in the sense that it allow them to focus on their business logic. The framework used to allow companies to produce more and faster, involving more people and reusing more resources, which also had the effect of increasing the quality of IT systems. That's why I strongly believe that FaaS is the new generation of modern frameworks.


  1. It can be an OCI image, a WASM binary...
  2. http calls on a webhook, messages on queues with a message bus or broker system such as Kafka or NATs, cron/scheduler events, etc...
  3. RAM, CPU, etc...
  4. Yeah cloud services are often built on top of cloud services. For example a FaaS is often built on top of a CaaS which is built on top of an IaaS (Infrastructure as a Service)
  5. We can observe that lot's of IoT company which build their device on top of chips like ESP32 are providing a lowcode editor based on Blockly, such as M5Stack which is very popular in China
  6. That's mainly the strategy of AWS which is re-using lambda for other services such as Glue ETL for datascientists for example, but also there's something for the IoT makers who want to trigger some jobs with MQTT events and multiple other examples...
  7. It can run on a raspberrypi like it can hyperscale on Kubernetes clusters using knative or keda or any other CaaS infrastructures. I plan to deep dive into the architecture of our FaaS, but it'll be for another blogpost ;-p

· 7 minutes de lecture
Idriss Neumann

In this blogpost, I ain't describe in details what Pulumi is doing. I already talked about many times in previous blogpost but also in IT conference such as DevoxxFR:

devoxxfr-pulumi-university

Yeah I know it's in French, I'm sorry for non-French speaker1. Let me give you a bit of context: in this conference, we presented how Pulumi is working to allow people to us to use your favorite programming language to do some IaC and also how we can use it to transform this IaC as a real product with its own API and CLI. We called it Deployment as a Service or DaaS.

And that's why we're using it in our driver system2 for cwcloud:

daas-classical-iaas

If you want to learn more about it, we also detailed the DaaS concept in this tutorial.

So now we've said that this tool is more suitable for people who want to deliver their IaC as a Service, I'll also try to explain my point of vue of why this tool is better for almost everyone including people who enjoy using Terraform or IaC with declarative languages such as HCL (Hashicorp Configuration Language).

First of all, I think a declarative language such as HCL is kind of a bad compromise for people who ain't working the same way:

  • classic system administrators who wants to only configure and avoid implementing any kind of logic
  • SRE (System Reliability Engineer) or Platform Engineer who wants to use a turing complete programing language and be able to implement business logic in their IaC

The usual way to solve this is to use a configuration language such as YAML which is easy to read but also templatisable using an engine such as jinja2 (used by Ansible), or go template (used by helm).

However Hashicorp tried to reunite both needs with a single language for all its products including Terraform and it leads to something nobody likes very much, neither the developers nor the system administrators if you want my honest opinion3.

Here's an example of service we'd like to enable or disable with a enable_my_service flag and also manage high availability4 with another high_availability flag:

resource "aws_instance" "my_service" {
count = (var.enable_my_service == true ? (var.high_availability == true ? 3 : 1) : 0)
ami = data.aws_ami.ubuntu.id
instance_type = "t2.micro"
subnet_id = aws_subnet.subnet_public.id
tags = merge(local.common_tags)
}

You find this ugly? Wait for this: sometimes when the terraform provider isn't supporting an endpoint from the IaaS API, you have to use an "external" datasource.

Few years ago, that's something I had to do in order to get a PCS (Private Service Connect)5 id from GCP (Google Cloud Platform) and inject-it in the elastic provider to make the connection:

data "external" "get_psc_id" {
program = ["bash", "${path.module}/get_psc_id.sh", var.region, var.gcp_host_project, google_compute_forwarding_rule.psc_consumer[0].name]
count = (var.enable_psc == true ? 1 : 0)
}

resource "ec_deployment_traffic_filter" "traffic_filter" {
name = "${var.stage}-${var.project}-${var.region}-filter"
region = "${var.region_prefix}-${var.region}"
type = "gcp_private_service_connect_endpoint"
count = (var.enable_psc == true ? 1 : 0)

rule {
source = data.external.get_psc_id[0].result.pscConnectionId
}
}

And the program invoked by the external datasource has to be written in another programing language, such as bash in this example:

#!/usr/env/bin bash

set -eu

region="$1"
projet="$2"
name="$3"

jsonOutput="$(gcloud --project="$projet" compute forwarding-rules describe "$name" --format json --region "$region")"
pscId="$(echo "$jsonOutput"|jq -r .pscConnectionId)"
pscIp="$(echo "$jsonOutput"|jq -r .IPAddress)"

echo "{\"pscConnectionId\": \"${pscId}\", \"IPAddress\": \"${pscIp}\"}"

Using Pulumi, I'd be able to parse the output of the gcloud directly in my Python or Go code.

Having a tool reuniting multiple needs might be a good thing but in this case, I'd prefer when this tool is providing several languages. And that's exactly what Pulumi offers.

Indeed you can use Pulumi with your favorite programming language but there's also a simple declarative YAML interface available. Here's what it looks like (example from the pulumi blogpost):

name: yamldemo
runtime: yaml
resources:
bucket:
type: aws:s3:Bucket
properties:
website:
indexDocument: index.html
index.html:
type: aws:s3:BucketObject
properties:
bucket: ${bucket.id}
content: <h1>Hello, world!</h1>
contentType: text/html
acl: public-read
outputs:
url: http://${bucket.websiteEndpoint}

In my opinion it's a better approach to continue to answer everyone's needs: choose the language you like to work with, including the most known declarative language used for configuration which is YAML nowadays. And even if you don't like YAML, it's pretty easy to produce a YAML file from another format (way more easy than producing code from templates) ;p

That been said, it's hard to catch up after years or decades of cloud players or vendors interfacing their IaaS (Infrastructure as a Service) or SaaS (Software as a Service) with public Terraform providers available on public registries. However at some point it happen: we can remember the hegemony of Puppet before Ansible during years.

In the case of Pulumi, despite the fact they've already done the job for most of the big cloud players, they also made very smart moves like:

  • providing tools to convert a terraform provider into pulumi SDK in every supported languages
  • you can also include terraform as dependancy directly in your Pulumi code
  • you can convert the state of resources earlier created with terraform into pulumi states

This makes it so easy to mitigate until all the providers understand that Pulumi is the way!

devoxxfr-pulumi-university

Pulumi isn't only an alternative for Terraform and IaC world with classical IaaS resources but it's shaking up the Kubernetes (K8S) world beeing the first solid alternative to helm.

Despite I like it very much and despite the fact that the go template doesn't bothers me at all, I must admin that it's hate by a lot of people who prefer to use kustomize despite the duplication it generate or even prefer to handle K8S manifests themselve with some piece of code...

That's why Pulumi is providing a Kubernetes package and I think it could be the right call for those people because they'll be able to implement deployment logic using the programing language they like instead of a templating engine they despise.

And what is beautiful is that you can also include and re-use public helmcharts coming from public regisries exactly the same way you can re-use Terraform provider inside your Pulumi code.

Here's an example of invoking with values the nginx-ingress helmchart inside a Pulumi Python's code (example from the official documentation):

from pulumi_kubernetes.helm.v3 import Chart, ChartOpts, FetchOpts

nginx_ingress = Chart(
"nginx-ingress",
ChartOpts(
chart="nginx-ingress",
version="1.24.4",
fetch_opts=FetchOpts(
repo="https://charts.helm.sh/stable",
),
values={
"controller": {
"metrics": {
"enabled": True,
},
},
},
),
)

Amazing isn't it?

To conclude, we can see how Pulumi smartly meets a large number of needs in the IaC world: people who prefer to configure, those who prefer to develop, people working with K8S or classical IaaS resources...

Like I said multiple times in my previous blogposts for PaaS (Platform as a Service) or frameworks and it includes IaC tools as well: interoperability, agnosticity and polyglotism are keys to success.


  1. I plan to have this talk in future english speaking events, maybe a Pulumi's official meetup, stay tuned :p
  2. At the time of writing we succeed to develop a driver, using the available pulumi modules, for the following cloud providers: AWS, GCP, Azure, Scaleway, OVH and Cloudflare
  3. Of course you'll find people who says they like Terraform and HCL but keep in mind that Terraform is 10 years old now and the challengers like Pulumi or Crossplane aren't that old. So it make sens that lot of them could have developed some kind of "digital Stockholm Syndrome" because they succeed to use it for years
  4. Let's assume High Availability means deploying three nodes of this service
  5. It's a way to establish private connections between your VPC and external services hosted somewhere else like elastic cloud in this example. More information here.

· 6 minutes de lecture
Idriss Neumann

In this blog post, I'll try to explain why we moved from ElasticStack to Quickwit and Grafana and why we choosed it over other solutions.

First, we've been in the observability world for quite some time and have been using ElasticStack for years. I personally used Elasticsearch for more than 10 years and Apache SolR before for logging and observability usecases even before Elasticsearch's birth!

We also succeed to use ElasticStack for IoT (Internet of Things) projects and rebuilt our own images of Kibana and Elasticsearch for ARM32 and ARM64 before Elastic (the company) starts to release official images. We had a lot of fun with it.

rpi-elastic

However everyone who works with it on premises know that Elastic is a big distributed system which brings everyone lot of struggles such as:

  • The log retentions because it's on filesystem and storage on disk is expensive1
  • Like most of highly distributed databases developed in Java, it has a very high footprint, consumes a lot of RAM...
  • You have also some issue such as "split brains" when you're dealing with HA (High Availability)

On the other hand, there's SaaS (Software as a Service) observability solutions such as Datadog or Elastic cloud which are saving you the trouble of managing clusters but which are very expansive. And even putting the price aside, most of our customers are required to keep all the data on an infrastructure they own.

That been said, Grafana proposed an alternative which is called Grafana Loki which is storing the data on object storage. The idea of using object storage is great because it's often implementing HA by design on most of the big cloud players and it lower the price a lot. Moreover, even when you're on premises, you often want to only ensure the HA of fewer components, the object storage amongs them.

However we weren't convinced because Loki ain't implemented a real search engine such as Apache Lucene used by both Elasticsearch and SolR. It also appears to be very slow as well with bad feedbacks from the community such as this one.

So we were looking for a solution who combines the advantages of both worlds: an efficient search engine which compensates the slowness brought by the use of the object storage's API.

And yet we discovered Quickwit \o/.

quickwit-gui

Quickwit is built on top of Tantivy which is similar to Lucene but written in Rust2, and also store the indexed data on object storage. That's the main reason making Quickwit better than Loki3 and Elasticsearch in my opinion.

Quickwit is also bringing lot's of integration with the CNCF ecosystem4:

  • A datasource for Grafana
  • OpenTelemetry interoperability for traces and logs ingestion
  • Jaeger's GRPC API interoperability which allows us to use Quickwit as a storage backend for traces and keep the Jaeger UI or Jaeger datasource on grafana. This is the only known solution to store Jaeger traces on object storage
  • Elasticsearch or Opensearch5's API interoperability
  • Falcosidekick which can use Quickwit as an output
  • Glasskube which makes easier the Quickwit's installation on Kubernetes6

quickwit-gui

That's why we decided to propose Quickwit as our main observability solution in cwcloud DaaS (Deployment as a Service) platform. You can checkout this tutorial to get more informations.

quickwit-cwcloud

Moreover, we also started to migrate most of our customers infrastructures to Quickwit instances and recommand to design their new applications with the OpenTelemetry's SDK available in their stack when it's possible or use Vector from datadog which is bringing lot of advantages as well:

  • It's very fast and has a very low footprint comparing to some other well-known solutions such as Fluentbit, Logstash and even Filebeat from ElasticStack (probably because it's written in Rust :p ).
  • It provides a very powerful VRL (Vector Remap Language) language in order to remap your logs and make-it compliants with some already existing indexes mapping7.
  • It's working with Kubernetes but also with docker and even logs written on filesystem by legacy applications. And this is very convenient for us because as explained in my previous blog post Docker in production, is it really bad?, we have lot of customer who are using docker in production (through cwcloud's DaaS) instead of Kubernetes.

For most of them as for our own internal use, we have divided the compute consumption at least by 3 while increasing the retention. Larger companies successfuly created astronomical logging service with Quickwit such as Binance with 100PB of stored data.

So now Quickwit is covering our observability needs in terms of logs and traces but we still miss the metrics. For the metrics usecase we're using VictoriaMetrics which is working pretty well but lacks the support of object storage. We know that Quickwit plans to handle this usecase one day with a real TSDB (Time Series Database) which sounds really promising. I'm quite convinced that separating the compute from the storage and propose object storage is now a success key factor for building modern observability solutions.

To conclude, I still think ElasticStack is a great product with a bigger company behind which is providing more advanced features including AI (Artificial Intelligence) capabilities. I might still offer it to some customers who might be interested by some of those features or even using Elasticsearch as a full-text search engine as a dependancy of some applications or microservices (Quickwit isn't the best choice in this case, it's more suitable for observability usecases only).


  1. We know that Elasticsearch is providing object storage compatibility with the searchable snapshot feature but it's not available in the opensource version on one hand, and only recommanded on cold data which are not supposed to be fetch too much on the other hand.
  2. Tantivy is 2x faster than Lucene according to this benchmark, this compensate the slowness brought by the use of the object storage.
  3. Quickwit also provides this benchmark with Loki, trying to make a fair comparison.
  4. I'm involved myself to contribute to lot of them, missioned by Quickwit Inc. (the company).
  5. OpenSearch is a fork of ElasticStack initiated by Amazon AWS.
  6. I wrote a blog post directly on the Quickwit's blog if you want to get more informations.
  7. You see an example of remap function in order to make the docker logs compliant with the default otel-logs-v0_7 index in this tutorial.

· 8 minutes de lecture
Idriss Neumann

Since the rise of Kubernetes (or K8S) and the OCI (Open Container Initiative) which standardizes the containerization on Linux, we can read more and more often that using docker1 as a runtime on production infrastructures is becoming a poor choice.

In this blogpost, I'll focus on answering to the criticisms that come from people from the containerization world, who are mainly convinced that K8S is the only viable way to deploy on production. There are also criticisms that come from people who are opposed to the principle of containerization itself. I'll probably answer to those another day.

In my previous blogpost Kubernetes or not, that's the question I already detailed how K8S and its ecosystem is lowering the deployment complexity, taking care of many things by design (autoscaling, reconciliation loops, make the observability easier by design...) and beeing the most standard IaaS (Infrastructure as a Service) API specification available everywhere (on premises, on almost every cloud providers...). It's sounds like the perfect fit to setup a real deployment platform that is making the deployment very easy and seemless everywhere. Adding some tools like teleport or knative it might completely become a real PaaS (Platform as a Service) and the SRE (System Reliability Engineers) operating those clusters can be seen as Platform Engineers.

So I ain't try to convince anybody to avoid going with K8S especially when it's a matter of building a new modern platform at a company scale or providing a multitenant service. I'm pretty convinced myself it's probably the better choice nowadays. That's also why we are providing a K8S version of our DaaS2 (Deployment as a Service) solution.

That been said, if we take a few steps back, we can see multiple advantages using docker and especially docker compose in 2024.

On one hand, lot of business, regardless of their size, have already running applications on virtual servers or compute engines. It might be a first step to start by containerizing their applications and switch from a process orchestrators such as systemd or pm2 to an OCI runtime like docker or containerd. The lift and shift once all apps are containerized to move to other infrastructures such as K8S but also CaaS (Container as a Service) like ECS on AWS or Cloudrun on GCP will be easier. It's basically a "Divide and Conquer" strategy. In my experience, telling those people from the beginning to not use a containers runtime on their existing machines might discourage them to start a migration despite the benefits.

On the other hand, the compose syntax can also be seen as another standard API specification in my opinion, such as the K8S's one. It's just doesn't handle as many thing as K8S. However it might be sufficient for lot of customers and it's by far more known by most of the developers.

Few years ago, during a DevoxxFR event, I heard someone say:

Docker was designed by developers in order to let them deploy their apps in production, K8S is the answer of sysadmin trying to take the control of the production's back

It was completely true, now the new generation of sysadmins who want to keep the control are called SRE. It's not completely the same mindset of Platform Engineer who want to give the control to the feature's teams. So maybe the Platform Enginners should provide an API standard which is easier, and for me compose is a really good candidate.

Moreover this idea isn't new. That's why the kompose exists since several years, and now Docker, Inc. (the company) is working on an experimental compose bridge project3. Docker, Inc. is also working to enrich the compose specification for years taking care of lot of production requirements such as healthchecks. So in my opinion this specification is far from beeing a local tool for developers only.

Polyglotism in deployment APIs is clearly a success factor for a PaaS (built on top of K8S or not) in my opinion: the more it provides several deployment APIs known by people, the more it meets everyone's needs. Exactly the same way the more programing languages and developer experience a FaaS (Function as a Service) is providing, the more it meets everyone's needs.

That been said, you might say:

Okay using the compose specification on K8S is fine. But you were talking about using the docker engine on virtual machines. And this still ain't bringing our expected PaaS, CaaS or FaaS platform, unlike K8S.

It sounds true, because using docker in virtual machines will requires to configure and secure the virtual machine with system administrators advanced knowledges, such as configuring the firewalling rules (using iptables, ufw, firewalld whatever), configuring a reverse proxy/load balancer in front of docker, configuring the system users and their privileges, enforce the SSH connection policies... Of course docker runtime can take cares everything about the resiliency of a single process, like systemd but all the rest remains.

Indeed, it appears that if you want to stay "modern" (auditable, gitops, using some Infrastructure as code, beeing able to rollback a change with a git revert, etc) you'll have to use terraform/opentofu/pulumi/whatever to provision the infrastructure, you'll have to setup ansible4 to configure the virtual machines... and that's too much work comparing to using a K8S managed cluster with helmcharts and gitops tools like ArgoCD or FluxCD.

However this work can be optimized with a DaaS platform such as cwcloud exactly the same way you are mutualizing your helmcharts and using umbrella charts to install a tenant of your application and its dependancy. We're providing a tool where you can templatize your "environments" (or deployments) using a pretty easy GUI or CLI. Here's an example for a templatized Wordpress installation:

cwcloud-env-wordpress-1

cwcloud-env-wordpress-2

Once you've done your set of ansible roles and the injected variables and documentation's template, it'll take only one API or CLI call (or even a single clic on the GUI) to instanciate a virtual machine and perform the complete installation with a git repository containing all the ansible configurations and which will triggers update pipelines in case of change (in a modern gitops approach).

From a developer perspective, they just have to provide templates of their compose files inside an ansible role and re-use the other roles already developed and maintained by your engineering platform team. It starts to look like the way the platform engineers building their platform on top of K8S are working, right?

Okay that's really promising but still ain't seemless as a CaaS where the developer can also access to the pods... like we're doing with teleport on top of K8S or a CaaS based on knative such as Cloudrun.

There's an underrated quickwin to acheive this very easily: portainer. All it takes to have a modern platform with a nice GUI to manage all your containers on your virtual machines is a lightweight agent to run on those.

portainer-containers.png

portainer-shell.png

That's why we're proposing it with cwcloud to some of our customers. You can watch this demo to understand how you can easily transform your infrastructure built on top of virtual machines and docker into a real CaaS platfom using this combo5:

portainer_agent_demo

Portainer is also working with K8S which makes the lift and shift approach a lot easier.

To conclude, we like working with everyone answering their needs and we also like K8S very much (I already said it multiple times). Some of our customers are using K8S, some of them are perfectly fine with compute engine with a docker runtime. For example, we have customers with multitenant applications who wants to bill their own users with their cloud usage. It's more convinient this way because each customer is paying for its own compute instances instead of doing complicated FinOps with shared K8S clusters6. We have also customer who requires to have a seggregation with the data and network of their different tenants.

So yes it's still fine in 2024 to work with docker in production, you just have to find a way to align with the state of the art and modern cloud and DevOps practices :)


  1. In this blogpost, I'll refer only to the docker engine which is opensource and not Docker Desktop which isn't and manages many other things to help developers (Linux virtualization using QEMU to help handling microprocessor architectures interoperability...).
  2. You can checkout this tutorial to understand how DaaS is working with cwcloud and what's the difference between IaaS, PaaS and DaaS.
  3. This is pretty promising and unlike kompose you can develop your own mapping rules to convert your compose files into K8S manifests which will have the shape you want (it's kinda using helm to read the compose file as a value file if you want my opinion, with many helpers that make it easier). It was presented by Guillaume Lours and Nicolas De Loof from Docker, Inc. at the last DevoxxFR 🇫🇷.
  4. I only mention ansible because I consider it won the battle over puppet, chef, salt... for most of the remaining infrastructures based on virtual machines for a long time ;)
  5. Since this demo which is two years old, our design and portainer's design has improved a lot but this is still giving an idea on how it's easy to get a real CaaS platform on top of our DaaS.
  6. Yes we could use K8S with some tools like kubecost instead. However it's easier for them to directly see their customer's names associated to the compute directly in the final cloud bills.

· 9 minutes de lecture
Idriss Neumann

To cap off years of debates on Kubernetes (or K8S) is fitting for everyone or not, I will finally give my deep feeling which took me a bit of time to build after years of use.

In my case I really like K8S but I don't have a particular problem working with "traditional" infrastructures built on to of VM1 (Virtual Machines) especially because it sometimes benefits some of our customers.

Why this debate is still happening in 2024?

First, I'll try to understand why this debate is still hapening in 2024 and it's pretty simple. If I had to give an analogy for developers of using K8S vs. classic IaaS (Infrastructure as a Service) which provides VMs on demand, is comparable as working with the C programing language or a very high-level stack framework like Spring Boot.

So we had to expect this debate which comes exactly like the one of developers who consider that the framework's users are losing their skills and become proletarians vs. the framework's users who observe that they have a better business velocity. It's exactly the same debate and it's running and will continue to run for years exactly the same way.

Understanding the anti-K8S point of view

In Unix/Linux you already have everything you need to automate resilient infrastructures: command/shell interpreters and scripting languages, schedulers (cron, anacron...) and when you take these elements one by one lot of people will say to themselve "there's nothing complicated, it's very easy to use" or for some of them using it for years "why change a winning team?" and it's a an understandable point of view.

For some people who have capitalized their business for years on those technologies that aren't outdated at all, there's no return on investment in telling them to change if we are honest for a few minutes. If we take the time to think and put ourselves in those people's shoes, we should realize that we are asking them to work without any added value because they already got a high satisfaction rate with minimum of time to intervene in case of issue.

Some will even find flaws2 that can be legitimately debated in the current implementation of K8S like every technology including the Linux kernel. Particularly some cloud players which have implemented themselves a resilient control-plane model that has been working for years and hasn't been fully amortized. It's understandable that those are trying to sell their product which isn't necessarily badly done and which can still answer very well to lot of usecases on the market. However everyone should take a few steps back from the argument because no solution is perfect and in the end what matters is the pros and cons and the tradeoffs we choose to make as decision maker in order to keep the better ROI (return of investment).

Understanding the pro-K8S point of view

Now there's a new generation of sysadmin we can sometimes call "SRE" (which stands for System Reliability Engineers) or even Platform Engineers because we ask them to provide their skills and work routine As a Service with a better time to market as competitive as cloud players which are building their IaaS or PaaS (Platform as a Service) for decades. I'll try here to explain why K8S seems the perfect fit for those people.

Let's continue with the example of cron jobs on Linux. cron or anacron are great, very well known and running for years. If we ask those people to schedule some tasks with cronjobs and make them auditable, resilient in case of shutdown and even highly available, using those well known tools will require to add some structured logs with a monitoring system, implement the exponential retry in case of failure, install those crontabs on multiple servers and handle the concurrency with a semaphore/locks design, reconciliation loops...

That's starting to be a lot of things to handle to do something that seemed simple, right? And this is only one of the easiest example of things to handle when you're in charge of the reliability of your systems and it's also something that can matter for everyone including small business nowadays. It'd be a big mistake and bad judgement to think that's a too fancy consideration for small business especially when they are able to handle this for few dollars per months nowadays.

And that's the thing with K8S: it's already implemented by design without any effort, only 5 minutes of work with a single CLI or API invocation. And this only the simplest example I found but it's exactly the same for every deployment automation aspect we use to see as comodity for years.

What I'm trying to explain here is that K8S which is often seen as a distributed orchestrator for large business and which bring values only in case of lot of autoscaling requirements is a wrong perspective. K8S should be seen in fact the new generation of IaaS with a standard API which takes care of every comodity we can have regardless of the size of our business and which is working almost everywhere and avoid the vendor lock-in.

That been said, it's not because it's available almost everywhere that the pricing model or the cost to move what is already running brings the better ROI for everyone. On my side I deeply regret that there's still no serverless implementation of the K8S API in any cloud provider and that this API is still strongly coupled to a single codebase (with some flaws like we said before). The "standard API specification existing everywhere" is only theorical.

My personal point of view

You might know that we are building our DaaS (for Deployment as a Service) and FaaS (for Function as a Service) solution cwcloud because we strongly believe those are the best compromises between IaaS and PaaS. You can check-out this tutorial to get a better understanding.

We insist on continuing to provide, in an agnostic way, those services on both K8S and classical IaaS3 and don't complain about this because we consider that it's our job to adapt and try to make most of people happy because that's what brings us more business.

However we got to admit that the complexity of the DaaS implementation is reduced by far with K8S and it's noticeable just by seeing those two diagrams which present the architecture of the two implementations:

Without K8S

daas-classical-iaas

With K8S

daas-k8s

You can easily understand here which version took us more efforts ;-)

So, as a developer, I also strongly beleive that the Kubernete's API is lowering the complexity of the deployments exactly the same way a framework and runtime like Spring Boot is lowering as well the comodity we used to develop ourselves to expose our code as a microservice (http exposition of our business logic, abstraction, logs, metrics...).

As a developer, I also love coding with the C programing language as I love understand low level stuffs in Linux/Unix operating systems. I feel more powerfull and more competent with it. However as a manager I'd be a fool to try to avoid frameworks like Spring Boot to increase productivity. And that's also part of the engineering process to analyze the average velocity and ROI.

That been said, I ain't sayin that we should always pick the K8S option. Not at all, I already explained that it brings no added value for lot's of skillful people or already setup infrastructures on one hand, and also that the K8S pricing offers on most of cloud provider is still not great everywhere4 on the other hand. We still miss serverless offers where the billing is based on our pods consumption only to make people beleive that K8S isn't an orchestrator but a standard API definition to deploy everywhere with some kind of agnosticity.

Honestly, we have customers using cwcloud without K8S and it's working as great with sometimes, depending on the choosen cloud provider, a better pricing model. Like always we got to analyze the pros and cons for each of them and help them to accept some tradeoffs in order to bring the better business value possible.

And that's also the thing for companies which are looking for some high-level and uncomplex way to deploy their stack (PaaS, FaaS or DaaS): it's okay if it's running on top of K8S or not. What should only matter to you is if the SLA (Service Level Agreement) and your own velocity are good and if your provider is reliable. Let them sold you their features and not how they achieved them. And it's also okay to change your mind in the future and rebuild everything when you got more funds and more needs...


  1. I know that K8S is not related to containers anymore and can orchestrate VM (with kubevirt for example) or WASM binaries. Here I'm refering to classical IaaS or hypervisor we used to use almost everywhere before the rise of K8S and wich are still very used.
  2. For example K8S is relying on etcd which is a statefull component and which cannot be "high available" by design. But some distribution like K3S are offering the ability to replace it by something more reliable like NATs (and some cloud players has made their own rewriting of etcd).
  3. We are compliant with the IaaS API of AWS, GCP, Azure, Scaleway and OVH and Openstack for on premises infrastructures. I'm refering to those when I talk about "classical IaaS" which are providing storage and compute as VM.
  4. At the time of writing, Scaleway is offering a free shared control-plane which is pretty promising (it was unstable but getting better in my opinion). This way, you're only paying your nodes with the same pricing of any compute instances (and their pricing is very competitive, you can have a fully functional cluster for less than 40 dollars per month). But this kind of deal is not very common among the biggest cloud players. Anyway Scaleway's becoming a very great deal if you want my honest opinion :)