HTTP Learning Notes 2_HTTPS

I. HTTP

The emergence of HTTPS is to compensate for HTTP's shortcomings:

Communication content may be eavesdropped

HTTP uses plaintext communication (unencrypted), and content is easily eavesdropped. HTTP itself does not have encryption functionality and cannot encrypt communication content.

Unencrypted is not really a big problem originally, but the TCP/IP working mechanism causes any node on the entire communication line to be able to obtain the packet, and any node in the Internet may become a relay station, so encryption becomes urgent.

Encryption cannot prevent eavesdropping; eavesdroppers can still easily take the (encrypted) packet, but encryption can make the packet lose its value (as long as it's guaranteed that eavesdroppers cannot decrypt within a certain period of time).

Identity may be impersonated

The HTTP protocol has no identity verification mechanism and does not confirm the communicating parties, so there may be "fake clients" and "fake servers". In addition, the existence of DoS (Denial of Service) attacks is also because HTTP has no identity authentication mechanism.

This can be solved by having a trusted third-party institution provide digital certificates. Certificates can be used to indicate that the server is a "real server" and can also be used to indicate that the client is a "real client"; each party just needs to hold their respective certificates.

Certificates are so useful but not popularized because certificates provided by third-party institutions are charged (ranging from several thousand to over 100,000 RMB per certificate). For the server side, one certificate is fine, but massive client certificates are very expensive, and general applications cannot afford it, so only applications like online banking need to install client certificates.

P.S. As for what a certificate is and what the specific working mechanism is, there will be detailed explanations below. In addition, there is "self-signing" that does not require a third party, which is cost-effective in the long run, but ordinary enterprises cannot afford it either.

Communication content may be tampered with

HTTP cannot prove the integrity of the packet and cannot determine whether the packet has been intercepted and tampered with on the way, so there is MITM (Man-in-the-Middle) attacks.

Existing anti-tampering methods include MD5, SHA-1 and other hash value verification, as well as using PGP (Pretty Good Privacy) to create digital signatures for files, but none are easy to use because the client user must manually check whether tampering has occurred, and the browser cannot complete it automatically, which is very inconvenient. Moreover, even if this is done, it still cannot 100% prevent communication content from being tampered with, because PGP and MD5 themselves may also be rewritten.

II. Digital Certificates and Hybrid Encryption

1. Hybrid Encryption

Before understanding HTTPS, it is necessary to first understand what digital certificates are, and before that, you should know 3 encryption methods:

Symmetric encryption (shared key encryption): The simplest encryption method, where both communicating parties hold the same key.

The sender encrypts with the key and sends the ciphertext to the receiver. The receiver decrypts the ciphertext with the same key to obtain the plaintext, and vice versa.

Disadvantage: The key cannot be securely delivered to the other party (in network communication, information security can only be guaranteed through encryption. If the key can be securely delivered, it means other information can also be securely delivered).
Asymmetric encryption (public key encryption): Keys are used in pairs, divided into public key and private key (the public key is public, the private key is not public).

The public key is published. The sender encrypts with the public key, and the receiver decrypts the ciphertext with the private key to obtain the plaintext. When communicating in the reverse direction, the sender encrypts with the private key, and the receiver decrypts with the public key.

Disadvantage: Compared with symmetric encryption, asymmetric encryption and decryption processes have greater overhead (to guarantee that private key cannot be calculated from ciphertext + public key, complex algorithm support is needed).
Hybrid encryption (combining the above two methods): Use asymmetric method to transmit the shared key, then use symmetric method for communication.

Symmetric method has low overhead. The biggest problem is that the key cannot be guaranteed to be securely delivered, and asymmetric method can just solve the key sharing problem, so there is hybrid encryption.

First use asymmetric method to transmit the shared key, ensuring the key can be securely delivered, then use symmetric encryption for communication, avoiding the overhead of using asymmetric method for communication. Perfect.

HTTPS uses hybrid encryption.

2. Digital Certificates

A digital certificate is actually an encrypted public key. This certificate can simultaneously prove the identity of the certificate owner (digital certificate authority) and the public key owner (server/client). The specific principle is as follows:

Background: Server S decides to encrypt communication to ensure information security when customers conduct business online.

Publish S's public key (tell client C)

How to publish is a problem, because after C receives the public key, it cannot determine whether the public key is real (cannot prove S's identity). What if someone maliciously publishes it?

If C receives only 1 fake public key, it can encrypt the request with that public key, and then S finds it cannot decrypt and rejects the request. Just send one more request to solve the problem. But what if C receives 100 fake public keys? Continue to verify one by one?

Of course not. At this time, it is necessary to borrow a third-party digital certificate authority.

S asks the digital certificate authority CA to help publish the public key to prove that this public key is real (prove S's identity).

After CA receives the tip from S, it encrypts the public key with its own private key. The resulting ciphertext is the digital certificate. Then give this certificate to S, saying that in the future, just give the certificate to C, and C will know you are real (P.S. Actually, "certificate" is very appropriate. S goes to CA to get an ID card to prove its identity, and CA stamps a red seal. This is the "certificate").

Wait, why does C know whether the certificate is real at first glance? What if the certificate is fake? How to prove the validity of the certificate?

To avoid such endless proof, the browser has CA's public key built-in. When C gets the certificate from S, it immediately takes out CA's public key to decrypt the certificate. If the certificate is real, C can obtain S's public key (the certificate is the product of encrypting S's public key with CA's private key, so decryption naturally yields S's public key). This not only proves S's identity but also proves CA's identity.

Note that the browser's built-in CA public key is free. The browser must build in the public keys of major CAs to adapt to market needs, and CA does not need to give tips to browser suppliers. Why emphasize money all the time?

To explain that certificates are charged and not cheap, as mentioned earlier:

> Certificates are so useful but not popularized because certificates provided by third-party institutions are charged (ranging from several thousand to over 100,000 RMB per certificate). For the server side, one certificate is fine, but massive client certificates are very expensive, and general applications cannot afford it, so only applications like online banking need to install client certificates.

3. After C obtains S's public key, encrypt and send the shared key (Pre-master secret random password string).

S decrypts the received shared key with the private key, obtains the shared key, and then starts fast and secure communication.

The above example is a server certificate, used to prove the server's identity. Similarly, there can also be client certificates used to prove the client's identity, such as online banking U-shields.

Because certificates are expensive, only things like online banking have client certificates.

3. Self-Signing

Certificates from well-known CA institutions are relatively expensive, so some enterprises decide to create their own certificates (become their own CA). This is why sometimes when browsing web pages, the browser will pop up a window prompting that the site's certificate is untrusted, whether to continue visiting... Actually, it's because the browser does not have the site's public key built-in, so the browser sees the certificate but cannot prove whether the certificate is real or fake.

So self-signing is meaningless and cannot improve communication security, unless you become your own CA and become a well-known CA, and your public key is collected by browsers.

The benefit of self-signing is that certificates are very cheap, because as long as you have production conditions, you can produce unlimited certificates and distribute them freely.

But being your own CA is not that simple. CA requires very reliable security protection conditions, so it is reasonable for well-known CAs to charge for certificates, used to maintain the security of the certificate issuance system.

III. HTTPS

1. What is HTTPS?

HTTPS = HTTP + Encryption + Authentication + Integrity Protection

      = HTTP + Encryption (implemented by SSL + TLS) + Authentication (implemented by digital certificates) + Integrity Protection (implemented by digital certificates)
      
HTTP = TCP + IP

HTTPS = SSL + TCP + IP

P.S. TLS is a protocol developed based on SSL. Sometimes they are collectively referred to as SSL, so TLS is not in the last equation.

Note:

HTTPS is 2 to 100 times slower than HTTP.

On one hand, SSL communication itself is slow (SSL transmission processing); on the other hand, SSL has high overhead and slow processing speed.

There is no fundamental solution. Hardware like SSL accelerators (dedicated servers) can be used to improve SSL processing speed.

HTTPS does not always perform encrypted communication.

Only encrypt when transmitting sensitive information (use HTTPS). Use HTTP at other times to improve communication speed and reduce overhead.

2. Encryption Methods

Communication encryption

HTTP has no encryption mechanism, so SSL (Secure Sockets Layer) and TLS (Transport Layer Security) are introduced to encrypt communication.

After establishing a secure communication line with SSL, HTTP communication is performed on this line. This method is called HTTPS (HTTP Secure, Hypertext Transfer Protocol Secure) or HTTP over SSL.
Content encryption

Encrypt the packet body. Of course, this requires two-way cooperation from the client and server (both must support encryption/decryption functions).

Moreover, encrypting only the content is not safe because the packet header is not encrypted, and HTTP header injection attacks can break it.

3. Identity Authentication Methods

BASIC authentication

The server returns a 401 response requesting authentication. The client encodes the username and password with Base64 and sends it to the server. The server performs identity verification based on this string. If successful, return 200; otherwise, continue with 401.

Base64 encoding is not encryption and is no different from plaintext transmission. HTTP transmission is extremely insecure, so it is not commonly used.
DIGEST authentication

Since BASIC authentication using Base64 encoding to transmit passwords is insecure, simply encrypt the password (perform MD5 operation on the password).

It can prevent eavesdropping but cannot prevent identity impersonation, and is not commonly used.
SSL client authentication

Using certificates can prevent identity impersonation. The client sends the client certificate to the server, the server extracts the client's public key, and then starts HTTPS communication.

But client certificates are also charged, so they are not commonly used either.
Form-based authentication

The most commonly used is form-based authentication. Security depends on the Web application.

MD5 + salt (salted MD5) is generally used. It is necessary to explain here:

Salt is actually a random string generated by the server and saved in the server database.

When a new user registers, generate a salt string corresponding to the user ID, then concatenate the salt string with the user's plaintext password, perform MD5 on the concatenation result, and the result is the password string for that user in the user table.

When logging in, look up the table to get the salt corresponding to the user, concatenate it with the plaintext password to get the password string, then look up the user table to get the user's password and compare it to verify identity.

Adding salt is actually to prevent table lookup cracking of MD5. MD5 without salt can be quickly cracked by rainbow table lookup. After adding salt, password characteristics are reduced (users with the same plaintext password in the user table have different password strings).

There is a very good article about MD5 + salt. Please see WooYun: The Correct Way to Save Passwords with Salted Hash.

IV. HTTPS Communication Process

The client starts SSL communication by sending a Client Hello packet.

The packet contains the SSL version supported by the client and a list of encryption components (Cipher Suite) (supported encryption algorithms and key lengths, etc.).

When the server can perform SSL communication, it responds with a Server Hello packet.

Like the client, the packet contains the SSL version and encryption components. The server's encryption component content is filtered from the received client encryption components.

The server sends a Certificate packet.

The packet contains the public key certificate.

The server sends a Server Hello Done packet to notify the client.

The initial SSL handshake negotiation part ends.

After the first SSL handshake ends, the client responds with a Client Key Exchange packet.

The packet contains a random password string called Pre-master secret used in communication encryption. This packet has been encrypted with the public key from step 3.

The client continues to send a Change Cipher Spec packet.

This packet hints to the server that communication after this packet will be encrypted with the Pre-master secret key.

The client sends a Finished packet.

This packet contains the overall checksum of all packets sent so far. Whether this handshake negotiation is successful is determined by whether the server can correctly decrypt this packet.

The server similarly sends a Change Cipher Spec packet.
The server similarly sends a Finished packet.
After the Finished packet exchange between the server and client is completed, the SSL connection is considered established.

Of course, communication will be protected by SSL. From this point on, application layer protocol communication begins, that is, sending HTTP requests.

Application layer protocol communication.

Send HTTP request/response, with MAC (Message Authentication Code) packet digest appended. Checking the MAC can determine whether the packet has been tampered with to protect packet integrity.

Finally, the client disconnects the connection.

Send a close_notify packet.

Disconnect TCP connection.

Send TCP FIN packet to close TCP communication.

References

"Illustrated HTTP"