Transferring files between two or more machines is an essential part of the ETL (extract, transform, load) process. Of course, there are multiple ways to move data, including flat file databases. For example, you can physically copy the data onto a USB drive or send it to the recipient via email.
But methods like these are far less efficient than sending data via FTP. So what is FTP exactly, and how do you use it to transfer files and data? Keep reading for all the answers.
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
What is File Transfer Protocol (FTP)?
File Transfer Protocol is a network protocol for transferring, accessing, and managing files on a remote computer. Because the inventor of FTP specially designed it for data transfer, it is faster and better suited for this task than other protocols (such as HTTP). It is good for moving large files. Using the FTP Internet protocol, you can send and receive data between any two devices across a TCP/IP network.
The Network Working Group's RFC 959 specification formally defines file transfer protocol. This document includes definitions of the FTP model, the various data types supported by FTP, and the various FTP commands and modes.
In every FTP file transfer, there are two different machines involved: the FTP server and the FTP client. The server listens for incoming connection requests, while the client initiates an FTP session with the server.
Depending on your preference, you can start an FTP connection using the command line or via open-source FTP software such as Cyberduck (for the Windows and macOS operating systems) or FileZilla (for Windows, macOS, and Linux). Web browsers such as Google Chrome and Firefox formerly supported FTP connection in the past, but that's largely not the case anymore.
When connecting to the FTP server, the FTP client may need to undergo authentication (i.e. with a username/password or a security certificate). In anonymous FTP, however, users can connect to the FTP server without an account.
Firewalls like Windows Firewall will often block FTP connections by default. You need to enable TCP on port numbers 20 and 21 to initiate an FTP connection. There are two different ports because FTP actually makes use of two connections: a control connection to communicate with the FTP client and a data connection to transfer data. Port 20 is for the data channel, while port 21 is for the control channel.
More specifically, the data channel is set to port 20 when operating in active FTP. An FTP server can run in either active mode or passive mode:
In active FTP, the data connection will use port 20 (unless set to another port). The FTP client sends the PORT command to the server, which specifies the port on the client the server should connect to. The server then actively initiates the connection to the client.
In passive FTP, the data connection uses a random port. The client sends the PASV command to the server, which acts as a request for a port number on the server. The server then responds with a port number that it has opened for data transfer. Finally, the client (not the server) starts the connection to the server.
Because FTP transfers data using plaintext, it is not appropriate for many purposes (e.g. secure file transfer of personal or sensitive data). For use cases such as these, you should instead use SSH file transfer protocol (SFTP), which is an extension of the secure shell protocol (SSH) cryptographic protocol that encrypts the data it sends. Another secure FTP solution is FTPS (FTP-SSL or FTP secure), which extends FTP to add support for the transport layer security (TLS) cryptographic protocol.
FTP and ETL
Because FTP is so efficient at transferring data and files, it's an excellent addition to your ETL workflow. ETL is a data integration practice that describes the three steps of collecting and centralizing your enterprise data: information is first extracted from one or more sources, then cleaned and transformed to fit the target schema, and finally loaded into a target data warehouse. You can use ETL automation tools to upload and download FTP data at regular intervals, according to your business requirements.
As discussed above, you should use secure alternatives to FTP, such as SFTP and FTPS, when you're transferring sensitive and confidential data. In particular, SFTP can help your organization remain in compliance with data security and privacy regulations such as GDPR, CCPA, and HIPAA.
Beyond secure protocols, encrypting the data itself can offer another layer of protection when uploading and downloading data via FTP. Both client and server will need to have shared encryption and decryption keys.
How Integrate.io Can Help with FTP
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
If you're looking to bring FTP into your own ETL workflow, we can help. Integrate.io is a powerful ETL and data integration platform that comes with more than 100 pre-built integrations. We include full support for the FTPS and SFTP protocols, so you can securely transfer files during the ETL process.
Ready to use Integrate.io and FTP? Get in touch with our team of data integration experts today for a chat about your business needs or to start your 14-day pilot of the Integrate.io platform.