An adventure in trying to optimize math.Atan2 with Go assembly

This is a recount of an adventure where I experimented with some Go assembly coding in trying to optimize the math.Atan2 function. :smile:

Some context

The reason for optimizing the math.Atan2 function is because my current work involves performing some math calculations. And the math.Atan2 call was in the hot path. Now, usually I don’t look beyond trying to optimize what the standard library is already doing, but just for the heck of it, I tried to see if there are any ways in which the calculation can be done faster.

And that led me to this SO link. So, there seems to be an FMA operation which does a fused-multiply-add in a single step. That was very interesting. Looking into Go, I found that this is an open issue which is yet to be implemented in the Go assembler. That means, the Go code is still doing normal multiply-add inside the math.Atan2 call. This seemed like something that can be optimized. Atleast, it was worth a shot to see if there are considerable gains.

But that meant, I have to write an assembly module to be called from Go code.

So it begins …

I started to do some digging. The Go documentation mentions how to add unsupported instructions in a Go assembly module. Essentially, you have to write the opcode for that instruction using a BYTE or WORD directive.

I wanted to start off with something simple. Found a couple of good links here and here. The details of how an assembly module works are not necessary to mention here. The first link explains it pretty well. This will be just about how the FMA instruction was utilized to replace a normal multiply-add.

Anyway, so I copied the simple addition example and got it working. Here is the code for reference -

#include "textflag.h"

TEXT ·add(SB),NOSPLIT,$0
	MOVQ x+0(FP), BX
	MOVQ y+8(FP), BP
	ADDQ BP, BX
	MOVQ BX, ret+16(FP)
	RET

Note the #include directive. You need that. Otherwise, it does not recognize the NOSPLIT command.

Now, the next target was to convert this into adding float64 variables. Now keep in mind, I am an average programmer whose last brush with assembly was in University in some sketchy course. The following might be simple to some of you but this was me -

After some hit and trial and sifting through some Go code, I got to a working version. Note that, this adds 3 variables instead of 2. This was to prepare the example for the FMA instruction.

TEXT ·add(SB),$0
	FMOVD x+0(FP), F0
	FMOVD F0, F1
	FMOVD y+8(FP), F0
	FADDD F1, F0
	FMOVD F0, F1
	FMOVD z+16(FP), F0
	FADDD F1, F0
	FMOVD F0, ret+24(FP)
	RET

Then I had a brilliant(totally IMO) idea. I could write a simple floating add in Go, do a go tool compile -S, get the generated assembly and copy that instead of handcoding it myself ! This was the result -

TEXT ·add(SB),$0
	MOVSD x+0(FP), X0
	MOVSD y+8(FP), X1
	ADDSD X1, X0
	MOVSD z+16(FP), X1
	ADDSD X1, X0
	MOVSD X0, ret+24(FP)
	RET

Alright, so far so good. Only thing remaining was to add the FMA instruction. Instead of adding the 3 numbers, we just need to multiply the first 2 and add it to the 3rd and return it.

Looking into the documentation, I found that there are several variants of FMA. Essentially there are 2 main categories, which deals with single precision and double precision values. And each category has 3 variants which do a permutation-combination of which arguments to choose, when doing the multiply-add. I went ahead with the double precision one because that’s what we are dealing with here. These are the variants of it -

VFMADD132PD: Multiplies the two or four packed double-precision floating-point values from the first source operand to the two or four packed double-precision floating-point values in the third source operand, adds the infi-nite precision intermediate result to the two or four packed double-precision floating-point values in the second source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the destination operand (first source operand).

VFMADD213PD: Multiplies the two or four packed double-precision floating-point values from the second source operand to the two or four packed double-precision floating-point values in the first source operand, adds the infi-nite precision intermediate result to the two or four packed double-precision floating-point values in the third source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the destination operand (first source operand).

VFMADD231PD: Multiplies the two or four packed double-precision floating-point values from the second source to the two or four packed double-precision floating-point values in the third source operand, adds the infinite preci-sion intermediate result to the two or four packed double-precision floating-point values in the first source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the destination operand (first source operand).

The explanations are copied from the intel reference manual (pg 1483). Basically, the 132, 213, 231 denotes the index of the operand on which the operations are being done. Why there is no 123 is beyond me. :confused: I selected the 213 variant because that’s what felt intuitive to me - doing the addition with the last operand.

Ok, so now that the instruction was selected, I needed to get the opcode for this. Believe it or not, here was where everything came to a halt. The intel reference manual and other sites all mention the opcode as VEX.DDS.128.66.0F38.W1 A8 /r and I had no clue what that was supposed to mean. The Go doc link showed that the opcode for EMMS was 0F, 77. So, maybe for VFMADD213PD, it was 0F, 38 ? That didn’t work. And no variations of that worked.

Finally, a breakthrough came with this link. I wrote a file containing this -

BITS 64

VFMADD213PD xmm0, xmm2, xmm3

Saved it as test.asm. Then after a yasm test.asm and xxd test; I got the holy grail - C4E2E9A8C3. Like I said, I had no idea how was it so different than what the documentation said, but nevertheless decided to move on ahead.

Alright, so integrating it within the code. I got this -

// func fma(x, y, z) float64
TEXT ·fma(SB),NOSPLIT,$0
	MOVSD x+0(FP), X0
	MOVSD y+8(FP), X2
	MOVSD z+16(FP), X3
	// VFMADD213PD X0, X2, X3
	BYTE $0xC4; BYTE $0xE2; BYTE $0xE9; BYTE $0xA8; BYTE $0xC3
	MOVSD X0, ret+24(FP)
	RET

Perfect. Now I just need to write my own atan2 implementation with the fma operations replaced with this asm call. I copied all of the code from the standard library for the atan2 function, and replaced the multiply-additions with an fma call. The brunt of the calculation actually happens inside a xatan call.

Originally, a xatan function does this -

z := x * x
z = z * ((((P0*z+P1)*z+P2)*z+P3)*z + P4) / (((((z+Q0)*z+Q1)*z+Q2)*z+Q3)*z + Q4)
z = x*z + x

Then replacing it with my function, this was what I got -

z := x * x
z = z * fma(fma(fma(fma(P0, z, P1), z, P2), z, P3), z, P4) / fma(fma(fma(fma((z+Q0), z, Q1), z, Q2), z, Q3), z, Q4)
z = fma(x,z,x)

Did some sanity checks to verify the correctness. Everything looked good. Now time to benchmark and get some sweet perf improvement !

And, here was what I saw -

go test -bench=. -benchmem
BenchmarkAtan2-4     100000000     23.6 ns/op     0 B/op   0 allocs/op
BenchmarkMyAtan2-4   30000000      53.4 ns/op     0 B/op   0 allocs/op
PASS
ok  	asm	4.051s

The fma implementation was slower, much slower than the normal multiply-add. Trying to get deeper into it, I thought of benchmarking just the pure fma function with a normal native Go multiply-add. This was what I got -

go test -bench=. -benchmem
BenchmarkFMA-4                  1000000000    2.72 ns/op   0 B/op    0 allocs/op
BenchmarkNormalMultiplyAdd-4    2000000000    0.38 ns/op   0 B/op    0 allocs/op
PASS
ok  	asm	3.799s

I knew it! It was the assembly call overhead which was more than the gain I got from the fma calculation. Just to confirm this theory, I did another benchmark where I compared with an assembly implementation of a multiply-add.

go test -bench=. -benchmem -cpu=1
BenchmarkFma        1000000000      2.65 ns/op     0 B/op     0 allocs/op
BenchmarkAsmNormal  1000000000      2.66 ns/op     0 B/op     0 allocs/op
PASS
ok  	asm	5.866s

Clearly it was the function call overhead. That meant if I implemented the entire xatan function in assembly which had 9 fma calls, there might be a chance that the gain from fma calls were actually more than the loss from the assembly call overhead. Time to put the theory to test.

After a couple of hours of struggling, my full asm xatan implementation was complete. Note that there are 8 fma calls. The last one can also be converted to fma, but I was too eager to find out the results. If it did give any benefit, then it makes sense to optimize further. This was my final xatan implementation in assembly.

// func myxatan(x) float64
TEXT ·myxatan(SB),NOSPLIT,$0-16
	MOVSD   x+0(FP), X2
	MOVUPS  X2, X1
	MULSD   X2, X2
	MOVSD   $-8.750608600031904122785e-01, X0
	MOVSD   $-1.615753718733365076637e+01, X3
	// VFMADD213PD X0, X2, X3
	BYTE $0xC4; BYTE $0xE2; BYTE $0xE9; BYTE $0xA8; BYTE $0xC3
	MOVSD   $-7.500855792314704667340e+01, X3
	BYTE $0xC4; BYTE $0xE2; BYTE $0xE9; BYTE $0xA8; BYTE $0xC3
	MOVSD   $-1.228866684490136173410e+02, X3
	BYTE $0xC4; BYTE $0xE2; BYTE $0xE9; BYTE $0xA8; BYTE $0xC3
	MOVSD   $-6.485021904942025371773e+01, X3
	BYTE $0xC4; BYTE $0xE2; BYTE $0xE9; BYTE $0xA8; BYTE $0xC3
	MULSD   X2, X0 // storing numerator in X0
	MOVSD   $+2.485846490142306297962e+01, X3
	ADDSD   X2, X3
	MOVSD   $+1.650270098316988542046e+02, X4
	// VFMADD213PD X3, X2, X4
	BYTE $0xC4; BYTE $0xE2; BYTE $0xE9; BYTE $0xA8; BYTE $0xDC
	MOVSD   $+4.328810604912902668951e+02, X4 // Q2
	BYTE $0xC4; BYTE $0xE2; BYTE $0xE9; BYTE $0xA8; BYTE $0xDC
	MOVSD   $+4.853903996359136964868e+02, X4 // Q3
	BYTE $0xC4; BYTE $0xE2; BYTE $0xE9; BYTE $0xA8; BYTE $0xDC
	MOVSD   $+1.945506571482613964425e+02, X4 // Q4
	BYTE $0xC4; BYTE $0xE2; BYTE $0xE9; BYTE $0xA8; BYTE $0xDC
	DIVSD   X3, X0
	MULSD   X1, X0
	ADDSD   X0, X1
	MOVSD   X1, ret+8(FP)
	RET

This was the benchmark code -

func BenchmarkMyAtan2(b *testing.B) {
	for n := 0; n < b.N; n++ {
		myatan2(-479, 123) // same code as standard library, with just the xatan function swapped to the one above
	}
}

func BenchmarkAtan2(b *testing.B) {
	for n := 0; n < b.N; n++ {
		math.Atan2(-479, 123)
	}
}

And results -

goos: linux
goarch: amd64
pkg: asm
BenchmarkMyAtan2-4    50000000    25.3 ns/op       0 B/op      0 allocs/op
BenchmarkAtan2-4      100000000   23.5 ns/op       0 B/op      0 allocs/op
PASS
ok  	asm	3.665s

Still slower, but much better this time. I had managed to bring it down from 53.4 ns/op to 25.3ns/op. Note that these are just results from one run. Ideally, good benchmarks should be run several times and viewed through the benchstat tool. But, the point here is that even after writing the entire xatan code in assembly with only one function call it was just comparable enough with the normal atan2 function. That is something not desirable. Until the gains are pretty big enough, it doesn’t make sense to write and maintain an assembly module.

Maybe if someone implements the entire atan2 function in assembly, we might actually see the asm implementation beat the native one. But still I don’t think the gains will be great enough to warrant the cost of writing it in assembly. So until the time issue 8037 is resolved, we will have to make do with whatever we got.

And that’s it !

It was fun to tinker with assembly code. I have much more respect for a compiler now. Sadly, all adventures do not end with a success story. Some adventures are just for the experience :wink:

How I landed my first contribution to Go

I have been writing open-source software in Go for quite some time now. And only recently, an opportunity came along, which allowed me to write Go code at work too. I happily shifted gears from being a free-time Go coder to full time coding in Go.

All was fine until the last GopherCon happened, where a contributor’s workshop was held. Suddenly, seeing all these people committing code to Go gave me an itch to do something. And immediately within a few days, Fransesc did a wonderful video on the steps to contribute to the Go project on his JustForFunc channel.

The urge was too much. With having an inkling of an idea on what to contribute, I atleast decided to download the source code and compile it. Thus began my journey to become a Go contributor !

I started reading the contribution guide and followed along the steps. Signing the CLA was bit of a struggle, because the instructions were slightly incorrect. Well, why not raise an issue and offer to fix it on my own ? That can well be my first CL ! Excited, I filed this issue. It turned out to be a classic n00b mistake. The issue was already fixed in tip, and I didn’t even bother to look. Shame !

Anyways, now that everything was set, I was wading along aimlessly across the standard library. After writing continuous Go code for a few months at work, there were a few areas in the standard library which consistently came up as hotspots in the cpu profiles. One of them was the fmt package. I decided to look at the fmt package and see if something can be done. After an hour or so, something came out.

The fmt_sbx function in the fmt/format.go file, starts like this -

func (f *fmt) fmt_sbx(s string, b []byte, digits string) {
	length := len(b)
	if b == nil {
		// No byte slice present. Assume string s should be encoded.
		length = len(s)
	}

It was clear that the len() call happened twice in case b was nil, whereas, if it was moved to the else part of the if condition, only one of them would happen. It was an extremely tiny thing. But it was something. Eventually, I decided to send a CL just to see what others will say about it.

Within a few minutes of my pushing the CL, Ian gave a +2, and after that Avelino gave a +1. It was unbelievable !

And then things took a darker turn. Dave gave a -1 and Martin also concurred. He actually took binary dumps of the code and examined that there was no difference in the generated assembly at all. Dave had already suspected that the compiler was smart enough to detect such an optimization and overall it was a net loss because the else condition hurt readability at no considerable gain in performance.

The CL had to be abandoned.

But I learnt a lot along the way, adding new tools like benchstat and benchcmp under my belt. Moreover, now I was comfortable with the whole process. So there was no harm in trying again. :sweat_smile:

A few days back, I found out that instead of doing an fmt.Sprintf() with strings, a string concat is a lot faster. I started searching for a victim, and it didn’t take much time. It was the archive/tar package. The formatPAXRecord function in archive/tar/strconv.go has some code like this -

size := len(k) + len(v) + padding
size += len(strconv.Itoa(size))
record := fmt.Sprintf("%d %s=%s\n", size, k, v)

On changing the last line to - record := fmt.Sprint(size) + " " + k + "=" + v + "\n", I saw pretty significant improvements -

name             old time/op    new time/op    delta
FormatPAXRecord     683ns ± 2%     457ns ± 1%  -33.05%  (p=0.000 n=10+10)

name             old alloc/op   new alloc/op   delta
FormatPAXRecord      112B ± 0%       64B ± 0%  -42.86%  (p=0.000 n=10+10)

name             old allocs/op  new allocs/op  delta
FormatPAXRecord      8.00 ± 0%      6.00 ± 0%  -25.00%  (p=0.000 n=10+10)

The rest, as they say, is history :stuck_out_tongue_closed_eyes:. This time, Joe reviewed it. And after some small improvements, it got merged ! Yay ! I was a Go contributor. From being an average open source contributor, I actually made a contribution to the Go programming language.

This is no way the end for me. I am starting to grasp the language much better and will keep sending CLs as and when I find things to do. Full marks to the Go team for tirelessly managing such a complex project so beautifully.

P.S. For reference -

This is my first CL which was rejected: https://go-review.googlesource.com/c/54952/

And this is the second CL which got merged: https://go-review.googlesource.com/c/55210/

Running JS Promises in series

After having read the absolutely wonderful exploring ES6, I wanted to use my newly acquired ES6 skills in a new project. And promises were always the crown jewel of esoteric topics to me (after monads of course :P).

Finally a new project came along, and I excitedly sat down to apply all my knowledge into practice. I started nice and easy, moved on to Promise.all() to load multiple promises in parallel, but then a use case cropped up, where I had to load promises in series. No sweat, just head over to SO, and look up the answer. Surely, I am not the only one here with this requirement. Sadly, most of the answers pointed to using async and other similar libraries. Nevertheless, I did get an answer which just used plain ES6 code to do that. Aww yiss ! Problemo solved.

I couldn’t declare the functions in an array like the example. Because I had a single function. I modified the code a bit to adjust for my usecase. This was how it came out -

'use strict';
const load = require('request');

let myAsyncFuncs = [
  computeFn(1),
  computeFn(2),
  computeFn(3)
];

function computeFn(val) {
  return new Promise((resolve, reject) => {
    console.log(val);
    // I have used load() but this can be any async call
    load('http://exploringjs.com/es6/ch_promises.html', (err, resp, body) => {
      if (err) {
        return reject(err);
      }
      console.log("resolved")
      resolve(val);
    });
  });
}

myAsyncFuncs.reduce((prev, curr) => {
  console.log("returned one promise");
  return prev.then(curr);
}, Promise.resolve(0))
.then((result) => {
  console.log("At the end of everything");
})
.catch(err => {
  console.error(err);
});

Not so fast. As you can guess, it didn’t work out. This was the output I got -

1
2
3
returned one promise
returned one promise
returned one promise
At the end of everything
resolved
resolved
resolved

The promises were all getting pre-executed and didn’t wait for the previous promise to finish. What is going on ? After some more time, got this (Advanced mistake #3: promises vs promise factories).

Aha ! So the promise will start to execute immediately on instantiation. And will resolve only when called. So all I had to do was delay the execution of the promise until the previous promise was finished. bind to the rescue !

'use strict';
const load = require('request');

let myAsyncFuncs = [
  computeFn.bind(null, 1),
  computeFn.bind(null, 2),
  computeFn.bind(null, 3)
];

function computeFn(val) {
  return new Promise((resolve, reject) => {
    console.log(val);
    // I have used load() but this can be any async call
    load('http://exploringjs.com/es6/ch_promises.html', (err, resp, body) => {
      if (err) {
        return reject(err);
      }
      console.log("resolved")
      resolve(val);
    });
  });
}

myAsyncFuncs.reduce((prev, curr) => {
  console.log("returned one promise");
  return prev.then(curr);
}, Promise.resolve(0))
.then((result) => {
  console.log("At the end of everything");
})
.catch(err => {
  console.error(err);
});

And now -

returned one promise
returned one promise
returned one promise
1
resolved
2
resolved
3
resolved
At the end of everything

Finally :)

Conclusion - If you want to execute promises in series, dont create promises which start executing. Delay their execution untill the previous promise has finished.

How to smoothen contours in OpenCV

Disclaimer: I am in no way an expert in statistics, so much of the details is beyond me. This is just an explanation of my attempt to solve the problem I had.


Recently, I was working with some cool stuff in image processing. I had to extract some shapes after binarizing some images. The final task was to smoothen the contours extracted from the shapes to give it a better feel.

After researching around a bit, the task was clear. All I had to do was resample the points in the contours at regular intervals and draw a spline through the control points. But opencv had no native function to do this. So I had to resort to numpy. Now, another problem in numpy was the data representation. Though opencv uses numpy internally, you have to jump through a couple of hoops to get everything running along smoothly.

Without wasting further time, here’s the code -

Get the contours from the binary image-

import cv2

ret,thresh_img = cv2.threshold(
			img,
			127,
			255,
			cv2.THRESH_BINARY_INV)
contours, hierarchy = cv2.findContours(thresh_img,
			cv2.RETR_TREE,
			cv2.CHAIN_APPROX_SIMPLE)

Now comes the numpy code to smoothen each contour-

import numpy
import cv2
from scipy.interpolate import splprep, splev

smoothened = []
for contour in contours:
    x,y = contour.T
    # Convert from numpy arrays to normal arrays
    x = x.tolist()[0]
    y = y.tolist()[0]
    # https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.interpolate.splprep.html
    tck, u = splprep([x,y], u=None, s=1.0, per=1)
    # https://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.linspace.html
    u_new = numpy.linspace(u.min(), u.max(), 25)
    # https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.interpolate.splev.html
    x_new, y_new = splev(u_new, tck, der=0)
    # Convert it back to numpy format for opencv to be able to display it
    res_array = [[[int(i[0]), int(i[1])]] for i in zip(x_new,y_new)]
    smoothened.append(numpy.asarray(res_array, dtype=numpy.int32))

# Overlay the smoothed contours on the original image
cv2.drawContours(original_img, smoothened, -1, (255,255,255), 2)

P.S.: Credit has to be given to this SO answer which served as the starting point.

As you can see, data conversion is required to pass to splprep. And then again, when you are appending to the list to overlay on the image.

Hope you found it useful. If you have a better way to achieve the same result, please do not hesitate to let me know in the comments !

Quick and Dirty intro to Debian packaging

Required background

I assume you have installed a debian package atleast once in your life. And you are reading this because you want to know how they are created or you want to actually create one.

Back story

Over my career as a software engineer, there were several times I had to create a debian package. I always managed to avoid learning how to actually create it by sometimes using company internal tools and sometimes fpm.

Recently, I had the opportunity to create a debian package to deploy a project for a client, and I decided to learn how debian packages were “actually” created - “the whole nine yards”. Well, this is an account of that adventure. :)

As usual, I looked through the couple of blog posts on the internet. But most of them had the same “man page” look and feel. And I absolutely dread man pages. But without getting discouraged, I decided to plough through. I came across this page which finally gave me some much needed clarity.

Into the real stuff !

So, these are the things that I wanted to happen when I did dpkg -i on my package -

  1. Put the source files inside a “/opt/<project-name>/” folder.
  2. Put an upstart script inside the “/etc/init/” folder.
  3. Put a cron job in “/etc/cron.d/” folder.

The command that you use to build the debian package is

$ dpkg-deb --build <folder-name>

The contents of that folder is where the magic is.

Lets say that your folder is package. Inside package you need to have a folder DEBIAN. And then depending on the folder structure where you want your files to be, you have to create them accordingly. So in my case, I will have something like this -

$ tree -L 3 package/
package/
├── DEBIAN
│   ├── control
│   └── postinst
├── etc
│   ├── cron.d
│   │   └── cron-file
│   └── init
│       └── project_name.conf
└── opt
    └── <project-name>
        ├── main.js
        ├── folder1
        ├── node_modules
        ├── package.json
        ├── folder2
        └── helper.js

Consider the package folder to be the root(/). Don’t worry about the contents of the DEBIAN folder, we’ll come to that later.

After this, just run the command -

$ dpkg-deb --build package

Voila ! You have a debian package ready !

If you see any errors now, its probably related to the contents inside the DEBIAN folder. So, lets discuss it one by one.

  • control

If you just want to build the debian and get it done with, you only need to have the control file. Its kind of a package descriptor file with some fields that you need to fill up. Each field begins with a tag, followed by a colon and then the body of the field. The compulsory fields are Package, Version, Maintainer and Description.

Here’s how my control file looks -

Package: myPackage
Version: 1.0.0-1
Architecture: amd64
Depends: libcairo2-dev, libpango1.0-dev, libssl-dev, libjpeg62-dev, libgif-dev
Maintainer: Agniva De Sarker <agniva.quicksilver@gmail.com>
Description: Node js worker process to consume from the Meteor job queue
 The myPackage package consumes jobs submitted by users to the Meteor
 web application.

The Depends field helps you to specify the dependencies that your package might require to be pre-installed. Architecture is self-explanatory. (Small note on this - debian uses amd64 for 64 bit systems, not x86_64.)

For further info, see man 5 deb-control

  • preinst

If you want to run some sanity checks before the installation begins, you can have a shell script here. Important thing to note is that the packager decides the execution of the installation of the package depending on the exit code of the scripts. So, you should write “set -e” at the top of your script. Don’t forget to make it executable.

  • postinst

This is executed after the package is installed. Same rules apply as before. This is how my postinst looks -

#!/bin/bash
set -e

#Move the bootstrap file to proper location
mv /opt/myPackage/packaging/bootstrap.prod /opt/myPackage/.bootstraprc

#Clear the DEBIAN folder
rm -rf /opt/myPackage/packaging/DEBIAN
  • prerm

Gets executed before removing the package.

  • postrm

Gets executed after removing the package. You usually want to execute clean up tasks in this script.

Taking a step further

As you can figure, this entire process can be easily automated and made a part of your build system. Just create the required parent folders and put the source code and config files at the right places. Also have the files of the DEBIAN folder stored somewhere in your repo, which you can copy to the target folder.

Since, I had a Node project, I mapped it to my "scripts":{"build": "<command_to_run>"} in package.json file. You can apply it similarly for projects in other programming languages too.

TLDR

Just to recap quickly -

  1. Create a folder you will use to build the package.
  2. Put a DEBIAN folder inside it with the control file. Add more files depending on your need.
  3. Put the other files that you want to be placed in the filesystem after installation considering the folder as the root.
  4. Run dpkg-deb --build <folder-name>

Keep in mind, this is the bare minimum you need to create a debian package. Ideally, you would also want to add a copyright file, a changelog and a man page. There is a tool called lintian that you can use to follow the best practices around creating debian packages.

Hope this intro was helpful. As usual, comments and feedback are always appreciated !